Here we analyze the size of the different sources. Therefore, we first count the number of different sources (i.e. the number of different strings in the table sources). Next we calculate the average number of sentences taken from each source and show the largest sources.
The information can be used to get a first impression of the sources of a corpus.
Sources may differ greatly in granularity. One might have some very large sources a whole volume of a newspaper and small sources like single newspaper articles. If they are mixed, the above table might be misleading.
Table 1:
select count(*) from sources;
Table 2:
select round((select count(*) from sentences) / (select count(*) from sources),2);
Table 3:
select source, count(*) as cnt from sources s, inv_so i where i.so_id=s.so_id group by source order by cnt desc limit 20;